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Abstract. Hidden Markov Models (HMMs) are learning methods for pattern 
recognition. The probabilistic HMMs have been one of the most used techniques 
based on the Bayesian model. First-order probabilistic HMMs were adapted to 
the theory of belief functions such that Bayesian probabilities were replaced with 
mass functions. In this paper, we present a second-order Hidden Markov Model 
using belief functions. Previous works in belief HMMs have been focused on the 
first-order HMMs. We extend them to the second-order model. 
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1 Introduction 

A Hidden Markov Model (HMM) is one of the most important statistical models in ma¬ 
chine learning [?]. A HMM is a classifier or labeler that can assign label or class to each 
unit in a sequence [?]. It has been successfully utilized over several decades in many 
applications for processing text and speech such as Part-of-Speech (POS) tagging [?], 
named entity recognition [?] and speech recognition [?]. However, such works in the 
early part of the period are mainly based on first-order HMMs. As a matter of fact, the 
assumption in the first-order HMM, where the state transition and output observation 
depend only on one previous state, does not exactly match with the real applications 
[?]. Therefore, they require a number of sophistications. For example, even though the 
first-order HMM for POS tagging in early 1990s performs reasonably well, it captures 
a more limited amount of the contextual information than is available [?]. As conse¬ 
quence, most modern statistical POS taggers use a second-order model [?]. 

Uncertainty theories can be integrated in statistical models such as HMMs: The 
probability theory has been used to classify units in a sequence with the Bayesian 
model. Then, the theory of belief functions is employed to this statistical model because 
the fusion proposed in this theory simplifies computations of a posteriori distributions 
of hidden data in Markov models. This theory can provide rules to combine evidences 
from different sources to reach a certain level of belief [?,?,?,?,?]. Belief HMMs in¬ 
troduced in [?,?,?,?,?,?,?,?,?], use combination rules proposed in the framework of the 
theory of belief functions. This paper is an extension of previous ideas for second-order 
belief HMMs. For the current work, we focus on explaining a second-order model. 
However, the proposed method can be easily extended to higher-order models. 
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This paper is organized as follows: In Sections [2] and [3] we detail probabilistic 
HMMs for the problem of POS tagging where HMMs have been widely used. Then, 
we describe the first-order belief HMM in Section [4] Finally, before concluding, we 
propose the second-order belief HMM. 


2 First-order probabilistic HMMs 

POS tagging is a task of finding the most probable estimated sequence of n tags given 
the observation sequence of v words. According to [?], a first-order probabilistic HMM 
can be characterized as follows: 

N The number of states in a model S t = {.v^ .s l 2 , ■ ■ -s^} at the time t. 

M The number of distinct observation symbols. V = {vi, V 2 , • • •, Vm}• 

A = {a,-,} The set of N transition probability distributions. 

B = {b j(o, ) } The observation probability distributions in state j. 

K = {7T/} The initial probability distribution. 

Figure[T|illustrates the first-order probabilistic HMM allowing to estimate the prob¬ 
ability of the sequence s)" 1 and s'- where aij is the transition probability from s'" 1 to 
s' and bj{o t ) is the observation probability on the state s'-. Regarding POS tagging, the 
number of possible POS tags that are hidden states S t of the HMM is N. The number 
of words in the lexicons V is M. The transition probability a, , is the probability that the 
model moves from one tag s)" 1 to another tag s'-. This probability can be estimated us¬ 
ing a training data set in supervised learning for the HMM. The probability of a current 
POS tag appearing in the first-order HMM depends only on the previous tag. In general, 
first-order probabilistic HMMs should be characterized by three fundamental problems 
as follows [?]: 

- Likelihood: Given a set of transition probability distributions A, an observation se¬ 
quence O = o\ , 02 , • • •, ot and its observation probability distribution B, how do we 
determine the likelihood P(0\A,B)1 The first-order model relies on only one obser¬ 
vation where bj(o t ) = P(oj\s t j) and the transition probability based on one previous 
tag where = P(s'Is'" 1 ). Using the forward path probability, the likelihood a, (j) 
of a given state s' can be computed by using the likelihood a, | (i) of the previous 
state s'" 1 as described below: 


Ot{j ) =Y,oc t - l (i)a ij bj{o t ) (1) 

i 

- Decoding: Given a set of transition probability distributions A, an observation se¬ 
quence O = 01 , 02 , ■■■ ,ot and its observation probability distribution B, how do 
we discover the best hidden state sequence? The Viterbi algorithm is widely used 
for calculating the most likely tag sequence for the decoding problem. The Viterbi 
algorithm can calculate the most probable path 8 r (j) which contains the sequence 
of i//, (/). It can select the path that maximizes the likelihood of the sequence as 
described below: 
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$t(j) = max 8 t -\(i)aijb j{o t ) 

Wt U) = argmaxy/,_ i 

- Learning: Given an observation sequence 0 = 01 , 02 , ■ ■ ■ ,oj and a set of states S = 
{sj ,$ 2 , ■ ■ •, s n } • how do we learn the HMM parameters for A and B ! The parameter 
learning task usually uses the Baum-Welch algorithm which is a special case of the 
Expectation-Maximization (EM) algorithm. 

In this paper, we focus on the likelihood and decoding problems by assuming a 
supervised learning paradigm where labeled training data are already available. 


3 Second-order probabilistic HMMs 

Now, we explain the extension of the first-order model to a fngra»0in the second-order 
model. Figure[2]illustrates the second-order probabilistic HMM allowing to estimate the 
probability of the sequence of three states s' -2 , s'- and s' k where a,jk is the transition 
probability from s^ 2 and to s‘ k , and bk(o t ) is the observation probability on the 
state si. Therefore, the second-order probabilistic HMM is characterized by three fun¬ 
damental problems as follows: 

- Likelihood: The second-order model relies on one observation bk(o t ). Unlike the 
first-order model, the transition probability is based on two previous tags where 
= P(s t k \s t i ~ 2 , s'^ 1 ) as described below: 

= Y j a t -i(j)a i jkb k (ot) ( 3 ) 

j 

However, it will be more difficult to find a sequence of three tags than a sequence 
of two tags. Any particular sequence of tags s\~ 2 , si 1 , s[ that occurs in the test 
set may simply never have occurred in the training set because of data sparsity 
[?]. Therefore, a method for estimating P{s t k \s t i ~ 2 even if the sequence s\~ 2 , 
s '-- 1 , s k never occurs, is required. The simplest method to solve this problem is to 
combine the trigram P(s t k \s t i ~ 2 the bigram and even the unigram 

P( s [) probabilities [?]: 


P(4I^- 2 ^- 1 ) = +Wl-r 1 )+W) (4) 

Note that P is the maximum likelihood probabilities which are derived from the 
relative frequencies of the sequence of tags. Values of A are such that Ai + A 2 +A 3 = 
1 and they can be estimated by the deleted interpolation algorithm [?]. Otherwise, 
[?] describes a different method for values of A as below: 


At = k 3 

A 2 = (1 — k 3 ) -k 2 
A 3 = (1— k 3 )-(l— ki) 


2 The trigram is the sequence of three elements, i.e. three states in our case. 


(5) 
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where £2 = 


log{C(s?j 1 ,4)+ 1 )+ 1 , _ l°g(C {ij 2 ,4 1 4)+ 1 )+ 1 


/o^(C(/r 1 ,4) + l)+2 


,&3 = 


’4)+l)+ 2 


, andC(s- ,/• 1 .s^) is the 


frequency of a sequence i; 2 .4 1 


, 4 in the training data. Note that Ai + A 2 + A 3 is 


not always equal to one in [?]. The likelihood of the observation probability for the 
second-order model uses B where b k {o t ) = P{o k \^ k ,s t j). 

Decoding: For second-order model we require a different Viterbi algorithm. For a 
given state s at the time f, it would be redefined as follows [?]: 


8,{k) =max8 t ^i(j)aij k b k {o,) 

where 8,(j) = max/ > (s 1 ,j 2 , • • • ,/ _I = s,-/ = Sj,o u o 2 , ■ ■ ■ ,o t ) 

Yt (k) = argmax Yt- 1 ( j)a ijk 

where \j/, (£) = argmaxP(s 1 ,s 2 , ■ ■ ■ ,s t ~ 1 = Si,s r = Sj, 01 , 02 , • • • ,o t ) 

- Learning: The problem of learning would be similar to the first-order model except 
that parameters A and B are different. 


With respect to performance measures, different transition probability distributions 
in [?] and [?] obtain 97.0% and 97.09% tagging accuracy for known words, respec¬ 
tively for the same data (the Penn Treebank corpus). Even though probabilistic HMMs 
perform reasonably well, belief HMMs can learn better under certain conditions on 
observations [?]. 


4 First-order Belief HMMs 

In probabilistic HMMs, A and B are probabilities estimated from the training data. How¬ 
ever, A and B in belief HMMs are mass functions (bbas) [?,?].According to previous 
works on belief HMMs, a first-order HMM using belief functions can be characterized 
as follow^ 

N The number of states in a model Q, = {.4 . .S'i,, ■ ■ ■ ■ S ' N }. 

M The number of distinct observation symbols V. 

A = {nJ^ 2, [5; _1 ](5'' )} The set of conditional bbas to all possible subsets of states. 

B = {mf' [o t ]{S r j )} The set of bbas according to all possible observations O,. 
n = 1 (-4 Q| )} The bba defined for the the initial state. 

Difference between the first-order probabilistic and belief HMMs is presented in 
Figure [I] the transition and observation probabilities in belief HMMs are described as 
mass functions. Therefore, we can replace a,y- by m^' and bj(o,) by [o t ](S r j) 

The set Q t has been used to denote states for HMMs using belief functions [?,?]. Note 
that s\ is the single state for probabilistic HMMs and .S - ' is the multi-valued state for be¬ 
lief HMMs. First-order belief HMMs should also be characterized by three fundamental 
problems as follows: 

- Likelihood: The likelihood problem in belief HMMs is not solved by likelihood, 
but by using the combination. The first-order belief model relies on (i) only one 
observation mf' [o t ] {S' ] ) and (ii) a transition conditional mass function based on 


4 In the model f2/, S' are focal elements 
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one previous tag nr t }' [S' 1 1 (.S’^-). Mass functions of sets A and B are combined using 
the Disjunctive Rule of Combination (DRC) for the forward propagation and the 
Generalized Bayesian Theorem (GBT) for the backward propagation [?]. Using 
the forward path propagation, the mass function of a given state S', can be com¬ 
puted as the combination of mass functions on the observation and the transition as 
described below: 


q%(S'j) = (S'- 1 )• ft(S^W■ (S'j) 0) 

Note that the mass function of the given state S'- is derived from the commonality 
function q9 ‘. 

- Decoding: Several solutions have been proposed to extend the Viterbi algorithm to 
the theory of belief functions [?,?,?]. Such solutions maximize the plausibility of 
the state sequence. In fact, the credal Viterbi algorithm starts from the first observa¬ 
tion and estimates the commonality distribution of each observation until reaching 
the last state. For each state S'-, the estimated commonality distribution (qf'(S'j)) is 
converted back to a mass function that is conditioned on the previous state. Then, 
we apply th epignistic transform to make a decision about the current state (yy (/■)): 


ft C S'j ) = .c A.-1 m V ft -1 ) ■ ft l st r l }(s'j) ■ ft (S') 

Vt(s*j) = argmax jS ,-i e ^_ i (1 -mf [S'^](0)) •R,[5'- 1 ](5') 


( 8 ) 


where A' = U 5,1-1 ^ ¥i(S'j) [?]. 

- Learning: Instead of the traditional EM algorithm, we can use the E 2 M algorithm 
for the belief HMM [?]. 


To build belief functions from what we learned using probabilities in the previous 
section, we can employ the least commitment principle by using the inverse pignistic 
transform [?,?]. 


5 Second-order Belief HMMs 

Like the first-order belief HMM, V, M, B and n are similarly defined in the second-order 
HMM. The set A is quite different and is defined as follows: 

A={m°>[S t r\s t r 1 ](S' k )} ( 9 ) 

where A is the set of conditional bbas to all possible subsets of states based on the 
two previous states. Second-order belief HMMs should also be characterized by three 
fundamental problems as follows: 

- Likelihood: The second-order belief model relies on one observation m9‘ [o;](5(.) 
in a state Si, at time t and the transition conditional mass function based on two pre¬ 
vious states S(- 2 and S, defined by m^ 2 '[S(- 2 ,S^- 1 ](S^). Using the forward path 
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propagation, the mass function of a given state S[ can be computed as the disjunc¬ 
tive combination (DRC) of mass functions on the transition m [,Sj 2 .S' J 1 j (,S’^) and 
the observation m9‘ (,S'i) as described below: 


( 10 ) 


where q^‘ , S‘~ 1 ] (S' k ) is the commonality function derived from the conjunctive 

combination of mass functions of two previous transitions. The conjunctive combi¬ 
nation is used to have the conjunction of observations on previous two states S'- 2 


./-l 


and S’ 

The combined mass function m s a 2 ' [5( _z ,5( _1 ](5^) of two transitions m“ f_1 
and m^' is defined as follows: 






The conjunctive combination is required to obtain the conjunction of both transi¬ 
tions. Note that the mass function of the given state S[ is derived from the com¬ 
monality function q~d’. We use DRC with commonality functions like in [?]. Note 
that the observation only on one previous state is taken into account in the first- 
order belief HMM, but the conjunction of observations on two previous states is 
considered in the second-order belief HMM. 

- Decoding: We accept our assumption of the first-order belief HMM for the second- 
order model. Similarly to the first-order belief HMM, we propose a solution that 
maximizes the plausibility of the state sequence. The credal Viterbi algorithm es¬ 
timates the commonality distribution of each observation from the first observa¬ 
tion till the final state. For each state Si, the estimated commonality distribution 
(q^ r (5^)) is converted back to a mass function that is conditioned on a mass func¬ 
tion of the two previous states. This mass function is the conjunctive combination 
of mass functions of the two previous states. Then, we apply the pignistic transform 
to make a decision about the current state (y/,(s T j)) as before: 


qf ( S[ ) = ')' • <£ (S[) 

Vt( 4 ) = argmaxs'-igrv! ( ! [S^ 1 ](0)) • ^[S?~ 2 ,S^ -1 ](Sj) 


( 12 ) 


- Learning: Like the first-order belief model, we can still use the E 2 M algorithm for 
the belief HMM [?]. 


Since the combination of mass functions in the belief HMM is required where the 
previous observation is already considered in the set of conditional bbas mff‘ [S( _2 ,S'- _1 ], 
we do not need to refine the observation probability for the second-order model as in 
the second-order probabilistic model. 


6 Conclusion and future perspectives 

The problem of POS tagging has been considered as one of the most important tasks 
for natural language processing systems. We described such a problem based on HMMs 
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Fig. 1. First-order probabilistic and belief HMMs 
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Fig. 2. Second-order probabilistic and belief HMMs 


and tried to apply our idea to the theory of belief functions. We extended previous works 
on belief HMMs to the second-order model. Using the proposed method, we will be able 
to easily extend the higher-order model for belief HMMs. Some technical aspects still 
remain to be considered. Robust implementation for belief HMMs are required where 
in general we can find over one million observations in the training data to deal with the 
problem of POS tagging. As described before, the choice of inverse pignistic transforms 
would be empirically verified]^] We are planning to implement these technical aspects 
in near future. 

The current work is described to rely on a supervised learning paradigm from la¬ 
beled training data. Actually, the forward-backward algorithm in HMMs can do com¬ 
pletely unsupervised learning. However, it is well known that EM performs poorly in 
unsupervised induction of linguistic structure because it tends to assign relatively equal 
numbers of tokens to each hidden state [?] ^Therefore, the initial conditions can be very 
important. Since the theory of belief functions can take into consideration of uncertainty 
and imprecision, especially for the lack of data, we might obtain a better model using 
belief functions on an unsupervised learning paradigm. 
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